[Bugfix] (qwen3_tts): enable batched offline inference by fixing tens…#1417
[Bugfix] (qwen3_tts): enable batched offline inference by fixing tens…#1417RomanKoshkin wants to merge 7 commits intovllm-project:mainfrom
Conversation
…or slicing Signed-off-by: Roman Koshkin <roman.koshkin@sbintuitions.co.jp>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6dd5ad26bd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
9cf6cab to
6dd5ad2
Compare
…xing tensor slicing Signed-off-by: Roman Koshkin <roman.koshkin@sbintuitions.co.jp>
|
@lishunyang12 Thanks for the review. I've addressed almost all the comments (except the NPU one, as I can't properly test it on my side). I've re-run the tests for |
|
@lishunyang12 I've also pushed the updated code to another branch. |
lishunyang12
left a comment
There was a problem hiding this comment.
Nice cleanup — most comments addressed. A couple things still open.
…ixing tensor slicing Signed-off-by: Roman Koshkin <roman.koshkin@sbintuitions.co.jp>
…ixing tensor slicing Signed-off-by: Roman Koshkin <roman.koshkin@sbintuitions.co.jp>
Signed-off-by: Roman Koshkin <26285991+RomanKoshkin@users.noreply.github.com>
|
can you add some performance tests here? |
|
@tzhouam qwen-tts also uses additional_info |
|
lishunyang12
left a comment
There was a problem hiding this comment.
Looks good now — all the previous concerns are addressed. The list-of-tensors output in make_omni_output is handled correctly by the generation model runner's dict branch. Nice work on the batched inference fix.
There was a problem hiding this comment.
LGTM. Core batching logic is correct. Previous review feedback has been addressed.
A few minor (non-blocking) suggestions:
-
Guard against empty
runtime_info_list: Ifruntime_info_listis[],task_types[0]will raise an uncontrolledIndexError. A quick early check would be cleaner. -
Warn on inconsistent non-batched kwargs: For scalar params like
max_new_tokens, only the first request's value is used silently. Consider logging a warning if values differ across requests. -
Use a set for key filtering:
if k not in ["text", "task_type", ...]→if k not in {"text", "task_type", ...}(minor, but cleaner). -
Extract
batched_keysas a class constant: The set{"ref_audio", "ref_text", ...}encodes model-specific knowledge — promoting it to a class-level constant with a brief comment improves maintainability.
|
@vllm-omni-reviewer |
Signed-off-by: Roman Koshkin <26285991+RomanKoshkin@users.noreply.github.com>
Purpose
This PR fixes two bugs in the
Qwen3-TTSmodel wrapper that prevented offline batched inference (max_batch_size > 1) from working correctly. Previously, passing a batch of inputs resulted in the engine either broadcasting the first request's output to all requests or crashing the worker due to length mismatches.Specifically, this PR addresses:
Qwen3TTSModelForGeneration.forward(),runtime_additional_informationwas hardcoded to pop index[0]. This has been updated to iterate over theruntime_info_listand properly accumulate batched inputs (texts,speakers,languages,ref_audio, etc.) into lists before passing them to the underlying generation methods.make_omni_output(), the batchedaudio_tensorsarray was hardcoded to only convert and returnaudio_tensors[0]. This has been fixed to convert the entire list of arrays to PyTorch tensors and return the full list inmultimodal_outputs, allowinggpu_generation_model_runner.pyto correctly map unique audio tensors back to their respective request IDs.Test Plan
Tested offline batched inference using
omni.generate()with a batch size of 80 requests.Since this is a fix to the core multimodal routing logic, no new automated tests are required. However, the fix can be manually verified using this minimal offline batching script:
Test Script